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Abstract 

We use a constant velocity steered molecular dynamics (SMD) simulation of the stretching of 
deca-alanine in vacuum to demonstrate a technique that can be used to create surrogate stochastic 
processes using the time series that come out of SMD simulations. The surrogate processes are 
constructed by first estimating a sequence of local parametric models along a SMD trajectory and 
then a single global model is constructed by piecing the local models together through smoothing 
splines (estimation is made computationally feasible by likelihood function approximations). The 
calibrated surrogate models are then "bootstrapped" in order to simulate the large number of work 
paths typically needed to construct a potential of mean force (PMF) by appealing to Jarzynski's 
work theorem. When this procedure is repeated for a small number of SMD paths, it is shown that 
the global models appear to come from a single family of closely related diffusion processes. Possible 
techniques for exploiting this observation are also briefly discussed. The findings of this paper have 
potential relevance to computationally expensive computer simulations and experimental works 
involving optical tweezers where it difficult to collect a large number of samples, but possible to 
sample accurately and frequently in time. 

PACS numbers: 



I. INTRODUCTION 



The truncated Taylor series of a nonlinear function is probably the most widely known example 
of a local model approximation. The work in ^ demonstrated some possible extensions of this 
basic concept to state-space diffusion models that aimed at utilizing the desirable features of a finite 
dimensional parametric estimator, but at the same time retained the flexibility associated with a 
"Taylor series". In this paper we demonstrate how a local diffusion modeling approach can be 
applied to time-inhomogeneous, state-dependent noise processes. The goal is to obtain a stochastic 
differential equation (SDE), with nonlinear coefficient functions, that accurately describes the 
dynamics of a single trajectory associated with the time series output of a steered molecular 
dynamics (SMD) simulation. The system used to illustrate the approach is the well-studied example 
of using constant velocity SMD to simulate the unravelling (in vacuum) of deca-alanine at constant 
temperature (extensive simulation results exist for this system because it is a "fast-folder" P]). 

A common objective of many of these types of SMD simulations is to derive a single low- 
dimensional effective diffusion equation valid at mesoscale times using a batch of time series of 

nriQ 

atomistic simulation output U, |5|, |q{ . This type of multiscale modeling approach often requires 
one to compute the potential of mean force (PMF) associated with the system. One technique for 
calculating the PMF associated with a set of well selected reaction coordinates 0] is to appeal to the 
Jarzynski work theorem j^] using the path dependent work (measured from the SMD simulations) in 
order to reconstruct the PMF associated with the selected reaction coordinate(s) of the system. To 
do this, one typically needs to assume variety of things about the system a priori Qjl^ |; even if all of 
the assumptions hold true for the selected reaction coordinate(s) of the system, calculating the PMF 
can be difficult in practice due in part to the number of sample paths needed in order to accurately 
approximate the PMF from atomistic simulation trajectories 0,0]. A large number of paths are 
needed because realizations with a low probability of occurrence can cause a large influence on the 
.-•MF (which depends on an exponential average) computed from a finite sample of realizations 
^. The requirement of a large number of atomistic simulation trajectories is problematic because 
current computational technology limits the amount of data that can be generated in a reasonable 
amount of CPU time. As a result, numerous schemes 0,0,01 have been developed for efficiently 
sampling from SMD simulations (the cited works focus on making efficient use of ensemble data 
created from a batch of genuine SMD runs). 
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The main focus in this work is to approximate and characterize the statistical properties of SMD 
simulations on a pathwise basis and then repeat the procedure over a small number of (frequently 
sampled) genuine SMD time series. It should be stressed that we determine a global diffusion 
process using the time series associated with a single SMD run; the information contained in a 
batch of SMD processes is not pooled together to determine a single diffusion process (each SMD 
process realization results in the estimation of a new diffusion model). If the surrogate process's 
dynamics are close to that of the genuine process then we have a means for generating the large 
number of (approximate) work paths typically needed in the Jarzynski work relationship. The large 
number of samples needed can come from bootstrapping the estimated models (that is, estimate 
the model parameters using actual SMD runs and then use the estimated models to give additional 
samples by using different random number sequences). The bootstrapped surrogate models used 
here are computationally cheap (relative to the SMD simulations) diffusion SDEs. 

The purpose of the diffusion models we estimate from SMD data is not to obtain equations from 
atomistic data valid at mesoscopic timescales. The intention is to approximate the work distribution 
associated with SMD simulations using simple diffusion models in hopes of constructing a surrogate 
process which can be used to replace the computationally expensive SMD runs. Calibrating a 
diffusion equation valid at meso- or even macroscopic timescales directly from atomistic data is 
problematic because the reaction coordinate description of the process neglects several details of 
the underlying system. It becomes difficult to quantify how the model imperfections and the errors 
introduced by finite sample estimation interact and ultimately affect the mesoscale equation. The 
local diffusion models we use are motivated by the overdamped limit of the Langevin equation. 
Throughout we operate in a regime that is intermediate to the under and overdamped limits; we 
simply find the "projection" (through estimation) of the data onto this particular diffusion model 
class. In thispaper we demonstrate our local estimation technique by modeling the dynamics of an 
established |a| reaction coordinate (z = the end-to-end distance of the deca-alanine molecule) using 
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SMD trajectories that were generated by the NAMD program In Section we demonstrate 
that we can approximate the work distributions fairly accurately (the accuracy is sufficient to closely 
reproduce the PMF of the system) which gives partial evidence that the overdamped approximation 
is reasonable for the system conditions studied (we later give additional statistical evidence which 
further justifies our overdamped approximation). 

Furthermore the dynamics of the reaction coordinate alone are not likely Markovian (which 
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is implicitly assumed in om' diffusion model) due in part to some of the reasons mentioned in 
the previous paragraph. The Jarzynski work relationship requires that all of the degrees of 
freedom of the system collectively obey Markovian dynamics (the work relationship can still be 
valid if the evolution of the reaction coordinate alone is non-Markovian, e.g. if the dynamics of 
the full system come from a Hamiltonian i49|]). If one could somehow reliably model the dynamics 



resulting from the interaction of the "heat bath" 



soi l with the reaction coordinate, then one would 



have more faith in appealing to a Markovian description of the reaction coordinate dynamics. We 
merely lump all of the effects of the heat bath into a Brownian motion (with state dependence on 
the local noise magnitude); the hope is that the surrogate model is good enough to approximate 
work distributions, but the simple nature of the model casts doubt on the validity of extrapolating 
the estimated models to larger length and time scales. The surrogate model we introduce can 
potentially be plugged into other schemes which aim at indirectly approximating the overdamped 
limit equation P, |^ (with the intention of determining a diffusion that is valid at meso- or even 
macroscopic time scales), but we stress (again) that the estimated models presented here should 
not be used directly to ambitiously extrapolate to larger length and time scales. 



The use of the probability integral transform (PIT) 



IJ] using the simulated maximum likelihood 



(SML) approximation of the transition densities associated with a global model constructed by 



piecing together 
fit test statistics 



ocal diffusion models is also investigated. The PIT is used to create goodness-of- 
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associated with the models calibrated from the data of our nonstationary SMD 
processes. The PIT allows one to judge the statistical validity of the assumed stochastic model 
given data generate d by the true process. The PIT can sometimes even offer insight concerning 

, 3|- In Section IVTl we give quantitative evidence which suggests that the 



model inadequacies 
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errors of the overdamped approximation we impose on the model are small in comparison to the 
errors associated with summarizing the dynamics of the system with a scalar end-to-end distance 
reaction coordinate (using a simple diffusion model). 

A major finding of this work is that in some systems it is possible to approximate the work 
distributions needed to compute the PMF from a relatively small number of SMD trajectories by 
appealing to estimation along individual sample paths and using the calibrated surrogate models 
to generate the large number of "synthetic" trajectories needed in standard applications of con- 
structing a PMF by using the Jarzynski nonequilibrium work relationship. It is shown (for the 
particular system studied) that the estimated global diffusion approximations associated with the 



different SMD trajectories result in what appears to be a single family of closely related nonlinear 
diffusion processes. In systems where the reaction coordinate has more complicated interactions 
with "the surroundings" (e.g. environmental asymmetries 

[2^) there is no guarantee that the diffusion approximations will belong to a single family. How 
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19| or jump-like transitions 



2ii 



ever, the findings reported here indicate that it may be possible to appeal to empirical Bayes 
and growth curve analysis 2^ techniques in order to approximate the SMD process. This would 
be computationally tractable if the SMD process under study can be adequately characterized by 
a relatively small number of "diffusion families" . This point is demonstrated using a toy model in 
Section rVIII The findings of this paper have potential relevance to computationally expensive com- 
puter simulations and experimental works where it difficult to collect a large number of samples, 



but possible to sample accurately and frequently in time P, ^, l20|. 

The remainder of the paper is organized as follows: Section ^ reviews the local modeling 
approach used. Section lllll reviews some basic facts about the statistical tools that we use for 
estimation and inference. Section HVI gives the equations that we use to estimate a PMF using our 
"synthetically" created data. In Section we briefly report some of the computational details. 
Section IVII gives our results; Section IVIII contains a discussion and presents a toy example which 
aims at demonstrating how the findings in this paper may be exploited; we then conclude. 
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II. LOCAL PARAMETRIC MODELING APPROACH 

We attempt to fit a global SDE of the generic form: 

dXt = b{Xt,t; e)dt + aiXt; Q)dWt. (1) 

to the output of a single SMD simulation (other independent SMD runs result in new SDEs) by 
appealing to estimation techniques related to maximum likelihood (ML). In the above equation, 
Xt corresponds to the reaction coordinate whose dynamics are being approximated, Wt is the 
standard Brownian motion, 6(-,-;0) and cr(-; 0) are the assumed drift and diffusion coefficient 
functions (parameterized by B), and all SDEs correspond to Ito integrals. The process above is 
referred to as the "diffusion process" in the sequel. 

One typically does not know a priori a parametric family of functions that the drift and dif- 
fusion coefficients belong to that can adequately describe the global dynamics (it should be noted 
that the Q "parameterization" is not a traditional Euclidean parameter in our global models). A 
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nonparametric approach j23, |2J] can sometimes successfully overcome this difficulty, but nonpara- 
metric estimators are typically not as efficient (asymptotically) as the finite dimensional parametric 
ML estimator. More importantly, a parametric framework allows us to impose a smooth structure 

icated structure associated with fast timescales. To do this 



48l |: we present methods that can quantitatively assist one in 



that "ignores" the often more comp^ 
with confidence, care must be taken 
this type of modeling. 

In this paper, we propose a local finite dimensional parametric estimator based on linear func- 
tions (these functions are then used to construct the drift and the diffusion coefficients). The 
nature of the time-dependent forcing term is typically prespecified in SMD simulations [a] which 
greatly facilitates selecting the functional form of our local models . In the constant velocity SMD 
application studied, our local models take the form: 



dXt =b^^^{Xt, t; e)dt + a^^^iXt; e)dWt 

aLOC(X; e) =V2(C + D{X - Xo) 
LOG 



\X,t;e) =(^A + B{X-Xo))+kp^n{x"''''{t)-Xy 

(2) 

In the above, Xo is a specified point at which the local model is centered and the parameter 
vector estimated is 9 = {A, B,C, D); the elements of this vector correspond to the constants used 
for a local approximation of the nonlinear global (assumed deterministic but unknown) functions 
(t(-) and b{-, •); A;puii is the harmonic constraint used in the constant velocity SMD simulation and 
X*°^^(t) is a deterministic function (in this study, X^^'^^{t) := X^'^ +Vpu\\t, where Xjc and Vpuii are 
specified constants). The constant ^ = ksT was set to 0.6 in reduced units (using 1 kcal/mol as 
the energy scale) where ks corresponds to Boltzmann's constant and T is the system temperature 
(300 K maintained by a Langevin heat bath). These local models are estimated at points (Xq) 
selected in an ad hoc manor (an optimal strategy for picking these points should be investigated 
in the future) and a global model is obtained by connecting the estimated constants A and C 
at the various Xo's selected using smoothing splines j^l (MATLAB's [csaps| function was used 
for spline smoothing). It should be noted that one could also develop interpolation schemes that 
utilize the information contained in the estimated parameters B and D (e.g. see Q|); this was not 



done in order to keep the presentation streamlined. It should also be noted that although we deal 
exclusively with the scalar case, the techniques presented here can trivially be extended to the 
case of multivariate time-inhomogeneous, state-dependent noise systems (which would be relevant 
to SMD experiments which require both the end-to-end separation of a molecule as well as some 
dihedral angle(s) ||2^] to adequately characterize the system). The major computational obstacle 
in the multivariate case is the number of parameters to be estimated (which can grow rapidly as a 
function of the state dimension). 

The most questionable assumption of the technique presented here is that the dynamics of the 
reaction coordinate associated with different SMD trajectories can be well-represented by simple 
diffusion processes. One potential way of incorporating a more realistic noise process is briefly 
discussed in Section fVIII In addition, the local process approximation technique could in principle 
be extended to jump-diffusion processes 45], but many fundamental estimation issues still need 
to be resolved before one attempts to use local models to construct global time-inhomogeneous 
jump-diffusion models that can be used to reliably approximate SMD trajectories. 

Unfortunately, even for our overly simple local diffusion models, an analytic expression for the 
transition density needed for a ML estimator of the local model is not even available in closed-form 
(when a complicated function like a spline is used for the coefficient functions of the global SDE, 
an analytic expressions seems hopeless to obtain). To overcome these difficulties, we appeal to 



the SML estimator 



in order to carry out estimation and inference (reviewed in Section 
IIII() . This method only gives a noisy approximation of the transition density needed for a ML 
type estimator; other approximations are possible (e.g. a deterministic extension of Ai't-Sahalia's 



expansion |29|). One appealing feature of a deterministic method is that the computation time 
needed to find the optimal parameter vector associated with a time series is typically much less than 
that associated with simulation based methods. The parametric estimator we propose does not 



of the conditions needed to ensure convergence of the time-dependent Hermite expansion 
Specifically, the proposed linear diffusion coefficient can take a value of zero which 



satisfy al 
given in 

can cause problems in the Hermite expansions 
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Even for models that do satisfy all of the 



conditions, a truncated expansion can yield improper density expansions which causes additional 
difficulty in estimation. Furthermore, one would probably not be able to adapt the method in [2^ 
to account for the spline coefficients we use for our global model whereas the SML method readily 
allows one to use these types of coefficient functions in the SDE model. 



When one uses a spline for the global functions in the SML, one does violate some of the 
assumptions presented in {27! that guarantee convergence of the SML approximation. Specifically 
infinite differentiability of the coefficient functions is not typically satisfied when one uses standard 
splines, but as mentioned in [^^l the infinite differentiability assumptions is probably not necessary 
to guarantee convergence of the approximation (coincidentally these assumptions are not violated 
in the local parametric models proposed). 



III. STATISTICAL TOOLS 



Maximum Likelihood Estimation 



We now recall a few basic facts about ML estimation; some standard references include 32 



33, 



34 



It is assumed throughout that the exact distribution associated with the parametric 
model admits a continuous density whose logarithm is well defined almost everywhere and is at 
least three times continuously differentiable with respect to the parameters js^. ML estimation is 
based on maximizing the log-likelihood (Cq) with respect to the parameter vector (9): 

£,^log(/(x;0)). (3) 



In the above equation, x corresponds to a matrix of observations G M'^^*''^ where d is the 
dimension of the state and M is the length of the time series; /(x; 9) corresponds to the probability 
density associated with observation x. For a single sample path of a discretely observed diffusion 
known to be initialized at xq, /(x; 9) can be evaluated as |32| : 

M-l 

f{x;9) = 6^^ll f{^m\^^.i;9). (4) 

m=l 

In this equation /(x^lxm-i;^) represents the conditional probability density (transition den- 
sity) of Xm given the observation x^-i and (^xo is the Dirac distribution. The associated log- 
likelihood (given 9, the data, and the transition density) takes the form: 

AI 

^9 ■= X] (/(Xm|Xm_i; 61)^ . (5) 

m=l 

We assume the existence of an invertible symmetric positive definite "scaling matrix" matrix 



^{M, e) |23l associated with the estimator; the subscripts are used to make the dependence of the 
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scaling matrix on M and 9 explicit. Under some additional regularity assumptions 
has the following limit for a correctly specified finite dimensional parametric model: 



33, 



one 



(6) 



Here 9 is the true parameter vector; 9m represents the parameters estimated with a finite 
time series of length M; =4> denotes convergence in distribution 2^ under where the 
aforementioned distribution represents that associated with the density f{x;9); M{0,I) denotes 
a normal distribution with mean zero and an identity matrix for the covariance. For a correctly 
specified model family, J'f^j^j g-^ can be estimated (numerically) in a variety of ways 
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. The 



appeal of ML estimation lies in that, asymptotically in M, the variance of the estimated parameters 
is the smallest that can be achieved by an estimator that satisfies the assumed regularity conditions 
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3J]. Our motivation for using a sequence of simple parametric local models comes from this 
asymptotic efficiency. 

Unfortunately exact analytic expressions for how Tf^j^^ scales asymptotically with M are 
typically difficult to determine for general nonstationary time series models. Throughout we simply 
assume that the scaling matrix associated with the local models proposed using the approximated 
transition density is close to the efficiency associated with the ML estimator. 



B. The SML Expansion 



Transition density expansions have been an active area of research in recent years. A vari- 
ety of methods have been introduced that aim at overcoming the fact that the transition den- 
sity associated with a general parametric diffusion model is typically unavailable in closed-form 
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3C, 
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42l |. It is well known 



43] that the simple intuitive "Euler estimator" (moti- 



vated by the transition density associated with the Euler-Maruyama scheme |4J|) yields significant 
bias in parameter estimates. The Euler-Maruyama discretization (using a time step of size At) 
corresponding to our local SDE is given by: 



Xtm =Xt^^,+biXt^_„tm-i;9)At 



(7) 
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Where m = 1 . . . , M, Xq is the exactly known initial condition, and {%}i=o,M-i is a sequence 
of standard normal random variables. The transition density (given below) associated with one 
step of the scheme above is what we refer to as the Euler estimator; the random variable Xt^ 
conditional on the value of ^t„_i is denoted by X^^IX^^ ^ and the Euler approximation of its 
density is given by: 



X. \X, 



(8) 



~ M[Xt^_, + biXt^_, , 0) At, aiXt,^_, ; Of At 

The symbol ~ denotes that the random variable on the left of the symbol is distributed with 
the law to the right; AA(/i,cr^) corresponds to a normal law with mean n and variance cr^ (so 
the transition density can be readily evaluated analytically). The Euler-Maruyama discretization 
does not correspond exactly to the assumed parametric SDE model and the difference can cause 
significant problems in estimation 
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\4$. In 123, 



28l | it is proved that an extension of the Euler 



estimator idea can yield a fairly reliable estimator. The SML approximation of the transition 
density associated with the observation pair {Xt^, Xt^_j^) is obtained by specifying two additional 
parameters Tsmi and Ng^y, these parameters are used to define the simulation parameter 5t = 

^ sml 

and 



' m' 'm'—l 'm'—l 

Where m' = 1 . . . , Tsmi — 1 and n = 1 . . . , Ng^y, in what follows the actually observed transition 
pair is separated by a time of length At, but the time indices used by the SML trajectories are 
given by r^/ := tm-i + m'St. The distribution associated with each SML simulation path is given 
by: 

(XtjXtX^Affx-^ ^_^+b{X-^ TT^^^^^;9)6t,a{X-^ ■,ef6t). (10) 

\ / \ sml sml sml / 

The transition density above is evaluated for a given (Xt^, Xt^^^^) using Ng^i paths and its 
value along path n is denoted by p^{Xt^\Xt^_-^;6). The transition density of the observation pair 
associated with the original time series is obtained by appropriately averaging over the Ng^i paths 
yielding the following approximation of the log likelihood function: 

M Afsml 

m=l n=l 
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The SML estimator is appealing because it is simple to implement in a computer program, but 
for the sample sizes used in SMD applications its computational cost can be a significant drawback. 
The optimal parameter vector using a Nelder-Mead search algorithm for a representative local 
model (repeated for ten SMD trajectories) took roughly one day on a PC with a 3.4 GHz Pentium 
IV processor which had 1 GB of RAM. The estimation task associated with multiple trajectories 
can obviously be trivially parallelized and the work load associated with this optimization is much 
less than that associated with typical SMD simulations, however a detailed statistical analysis of 
the data is greatly hindered (in regards to computation time) by a simulation based approach such 
as SML. Other simulation techniques are currently available, such as Gallant and Tauchen's EMM 
estimator [4^, however alternative approximations were not explored in this study. We quantify 
the bias introduced (and show it is significant) by the naive Euler estimator in Section IVTl but it 
should be noted that a deterministic optimization can be completed in less than an 10 minutes (on 
the same computing platform). This fact will hopefully motivate additional research into reliable 
general deterministic time-inhomogeneous likelihood expansion techniques. 



C. The probability integral transform 

The probability iiitegral transform (PIT) is a powerful tool which can be used in a variety of 
time series contexts (l5|. The technique requires one to have the ability to evaluate the transition 
density associated with the assumed model. In parametric diffusion modeling applications, previous 
studies have shown that various transition density expansions can be reliably used in place of the 
exact transition density to approximate the PIT. The PIT is a random variable (Zm) which is 

constructed from an observation pair and the assumed (or empirically measured) transition density. 



The scalar case is easiest to demonstrate (the multivariate is a fairly straightforward extension 
and the transformation is shown below: 



q{Zm;0) = 



dZrt 



Xm J [Xm\Xm-l) = (12) 

dXjji 

Under a correctly specified model, the Z^'s are uniformly distributed on the interval [0, 1] and 



independent of one another 



The transition density of the true process, f{xm\xm-i), may 
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not be adequately described by the assumed model class (whose transition density here is denoted 
by p{x 

m\^m—i')d) )• If this is the case, q{Z.m',9) will either not be a uniform density due to the 
discrepancy between f{xm\xm-i) and p{xm\xm~i', (^) and/or the {Zm} series constructed from the 
data and the assumed model will exhibit serial correlation A suite of statistical tests have 

)een developed which exploit these two facts; here we utilize the M— test statistic introduced in 
2|- To construct it one needs to first construct the series {Zm}m=i from the series {xm}m=o 
(for our applications this can be readily done once the global spline function is in hand). The 
test-statistic {M{m,l)) is given by the following formula: 

^(-' - -r^ • (13) 

V i=l ^ 

Where in the above pl^i{j) is the sample cross-correlation between and Z\._- (here the 
superscripts correspond to powers and not to time series indices) and w{-) is a weighting function 
for the lag orders j ^|. We make use of the Bartlett kernel for the weighting function {w{z) := 
(1 — |z|)l|2|<i, where 1a represents the indicator function for "event A" and p is a prespecified lag 
truncation parameter). 



IV. ESTABLISHED TECHNIQUES FOR ESTIMATING THE PMF 



Under appropriate assurngtions 
recent fluctuation theorems 



the Jarzynski work relationship (which is one of several 
allows one to calculate equilibrium properties from nonequilibrium 
simulations. It has received a lot of attention in biophysics due to its potential relevance to a 
wide variety of systems including protein folding, ion-channels, and wet lab experiments involving 
optical-tweezers and atomic force microscopes 



, ion- 



The purpose of this article is to attempt to develop a technique which can provide the large 
amount of synthetic data (obtained through estimating models based on the genuine process) 
needed for reliably estimating an exponential average of a random quantity. We will only give the 
equations needed to construct a PMF through exponential averages of the work done on the system. 

nnPQ 

A detailed treatment of the subject can be found in a variety of sources including |J,|a, 13)131 • The 



12 



Jarzynski work relation is as follows: 

' exp i-pW) ) = exp {-/3AG) (14) 



Where in the above W is the work done on the system and AG is the free energy difference 
between two equilibrium states. The work introduced changes the system from known equilibrium 
conditions characterized by the reaction coordinate(s) in "state 1" to value(s) associated with "state 
2" (for our problem the state is simply the numerical values of the linear end-to-end distance of the 
deca-alanine molecule at a given time). The brackets in Equation indicate an ensemble average 
over paths [€ I . The work done on the system is obtained by integrating in time; we use the work 
definition | given in ^, that is: 

t 

W{t) := -t;p,nVii / {^f - z'-'^{t'))dt' (15) 



There are a variety of techniques for overcoming the difficulty associated with taking averages 
of an exponential process. The stiff-spring approximation 0] requires the work distribution along 
the SMD path to be approximately Gaussian; this has the drawback of needing to experiment 
with different values of /cpuu and Vpuu in order to determine which values this approximately holds 
for in the particular system under study (this can be computationally very demanding in some 



SMD simulations). The cumulant expansion |0] is another alternative, but one still needs to make 
reliable estimates of the low order moments of an ensemble. Attempts at using ML estimation 
on the ensemble of paths to obtain efficient estimation has also been explored in previous works 
[lol |. It should be stressed that all of the aforementioned works focused on making best use of 
a finite set of work paths. The surrogate work paths created by our local diffusion models can 
be treated by these ensemble methods, but the focus of our approach is in approximating the 
statistical structure of the ensemble of SMD processes so that we can then use this structure to 
create additional bootstrapped samples after making observations on a relatively small number of 
)aths. The eventual hope is that the SMD time series associated with interesting "rare events" 
^ will correspond to a small number of coherent structures in our local parameter spaces and 
the information in our collection of surrogate models can be used to approximate the relative 
frequency and the distribution of magnitudes that these paths will contribute to the exponential 
average (though this is highly optimistic, the results of this study indicate this is not unreasonable). 
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V. COMPUTATIONAL DETAILS 



A. SMD Simulation Details 

Throughout this study the NAMD program [l^ ( |http : //www . ks . uiuc . edu/Research/namd/ 1 

was used to generate data. The pulling speed (fpuu) used for all simulations was 0.01^; a step 
size of 2 fs was used to integrate the equations of motion for 1 x 10^ SMD time steps, and the 
spring constant used for pulling (fcpuu) was assigned a value of 7.2 kcal/mol/A^ in all simulations. 
Additional computational details (including parameter files, many of the raw data sets, etc.) can 
)e downloaded from |http : /7www . ks . uiuc . edu/Train ing/T utorials/scieiice/lOAla-tutorieLl| 



B. Data Preparation 

The local models were calibrated using the nonstationary time series that came from the con- 
stant velocity SMD simulations. The net time series generated by the SMD simulations are denoted 
by X = {xjI^Iq. To use local parametric models, the observations were screened by inspecting which 
observations fell within a window centered at a specified Xq (which is the point at which a local 
model is desired). If M^™ (which is random) observations fell within the window, then a new 
series x"™ was constructed consisting of the M^™ pairs {xm,Xm+i) where the first coordinate is 
the observation that fell within the selected window. The screened time series was then passed 
to the scheme that was meant to approximate the log likelihood function (in this paper the SML 
approximation) . 

It should be stressed that the underlying data source only needed to be generated once. After 
the SMD data was created, we were free to try and calibrate any number of local models around 
the time series (this task can easily be distributed across independent processors). After using 
the observations to estimate a global model, the same SMD observations can be used again to 
determine quantitatively how well the data matches the calibrated stochastic model. 
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C. Estimation Details 



4q | searches where the Brownian 



The parameters were obtained by running 50 Nelder-Mead 
paths needed for the SML routine were generated once per search. This was done to minimize the 
variance associated with using a noisy hkelihood expansion technique (each parameter search used 
a different (random) initial parameter guess). After determining the optimal parameter vector, the 
parameters T^rni and Ng^i were increased to determine if the optimal parameter vector changed 
significantly. For the results shown, the values Tsmi = 50 and Nsmi = 350 gave results that did not 
change significantly after increasing the SML likelihood expansion parameters. 



VI. RESULTS 

A. Estimation Results 

Figures 11141 plot the various local model parameters associated with the assumed local model 
estimated at 12 different points using a window width of 2A centered at the values of z indicated 
by the presence of a symbol. The lines connecting the model parameters were constructed using 
MATLAB's smoothing spline function csaps. In all of the aforementioned plots, the dashed lines 
correspond to parameters estimated from data that was sampled from the NAMD simulation every 
50 time steps (this resulted in ~ 2100 observations per window on average) and the solid lines 
correspond to the parameters estimated from data that was sampled from the NAMD simulation 
every 150 time steps (this resulted in ~ 700 observations per window on average). The fact that the 
Langevin thermostat was used to regulate temperature in these simulations should also be explicitly 
noted (both cases plotted in Figures ^El used 7 = 5ps^^). In order to construct the global diffusion 
model from this data, we used the splines that came from piecing together the drift and diffusion 
constants {A and C). This was done because these parameter estimates were observed to have less 
relative noise associated with them due to estimation error in most cases (e.g. see Table QJ). The 
"linear sensitivity" {B and D) coefficients are still useful for determining features of the global 
nonlinear functions. For example note how the location of the local extrema in the global spline 
function can be predicted from the zero crossings of the estimated linear sensitivity components. 
One observes that the qualitative features of the drift and diffusion coefficient functions are similar, 
but the parameters estimated for each individual SMD realization appears to come from a slightly 
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different function (finite sample estimation noise alone should not connect smoothly if the true 
process came from a single scalar diffusion) . Each function appears to be a smooth distortion 
of one underlying (unknown) base function. It should also be noted that the noise parameter is 
significantly affected by the sampling frequency whereas the drift function appears to be relatively 
insensitive to the sampling frequency. Possible causes for the discrepancy will be discussed in the 
goodness-of-fit results reported later. For now it should be noted that the scalar reaction coordinate 
description of the system is itself a drastic oversimplification of the system; in addition it is also 
well known that the dynamics of a reaction coordinate that accounts for only positions and not 
velocities is not necessarily Markovian A SMD simulation that is carried out in the presence 
of explicit solvent has noise which is typically significantly correlated for the observation frequency 
we use to estimate the parameters of a diffusion model with (though the artificial "solvent" in our 
data was modeled using a Langevin thermostat which helped in reducing this type of correlation). 
If the true dynamics of the reaction coordinate are not Markovian then one should be concerned 
about the errors introduced when one tries to use the Jarzynski work relationship to construct a 
PMF using a Markovian surrogate process. The tests given in 2] quantitatively test the statistical 
validity of the simplifying assumptions we impose on the proposed stochastic model and even offer 
insight into the inadequacies of our simplified description and Section IVllI discusses one possible 
way in which we can introduce a more realistic noise process. 

Table^reports Monte Carlo parameters estimates obtained from local estimation using synthetic 
data (from a single global SDE) that was created using a trajectory of the 7 = Sps"-*^ NAMD runs 
(sampling every 50 SMD steps) and then using the estimated single (genuine) SDE to generate 100 
paths with the corresponding (exactly known) global spline coefficient functions. We see that for the 
sampling frequency used that the estimators produce significantly different results. Deterministic 
likelihood approximations are attractive because they can typically be executed with much less 
CPU time, but obtaining a general (reliable) deterministic expansion method for a wide class of 
multivariate, time-inhomogeneous, state-dependent noise cases appears to be beyond our current 
analytical capabilities (however the SML easily can accommodate this case at the cost of being 
computationally inefficient). 
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FIG. 1: The global drift function estimated by a smoothing spline fit of the constants measured 
from sampling of the time series coming out of the constant velocity SMD simulation. The dashed 
lines correspond to the spline obtained (constructed using the estimated local parameter values 
indicated by symbols) by sampling the data every 50 SMD times steps (O.lps) and the solid lines 
correspond to the parameters estimated using the same data sampling every 150 SMD times steps 
(0.3ps). All simulations used a damping coefficient 7 = 5 ps~^ for the Langevin heat bath. 




FIG. 2: The "linear sensitivity" (as measured by piecing together the B parameters through 
smoothing splines) of the global drift function. For computational details, see caption in Figure E 

B. Synthetic Work and Approximating the PMF 

The SDEs corresponding to the splines plotted in the previous figures were then used to gen- 
erate "synthetic" work paths. Each of the 10 estimated SDEs were used to create 100 synthetic 
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FIG. 3: The global diffusion function estimated by a smoothing spline fit of the constants. For 
computational details see caption in Figure E 
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FIG. 4: The "linear sensitivity" (as measured by piecing together the D parameters through 
smoothing splines) of the global diffusion function. For computational details, see caption in 
Figure [H 

realizations of the work paths associated with the SMD stretching experiment. Figure [S] plots the 
gradient of the estimated PMF using the various work paths and directly using the relation given 
in Equation (the inset plots the raw PMF). The dashed lines in the plots correspond to the 
PMF estimated from 100 copies estimated along a single (estimated) diffusion model; the solid 
black line corresponds to that of pooling the work paths from all synthetic trajectories together 
and then estimating the resulting PMF through evaluating the empirical exponential average and 
the solid red line plots the "exact" results j^. 
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TABLE I: MC Parameter Distributions - Local model parameters estimated from 100 realizations 
of a single, known SDE diffusion whose coefficient functions were determined by estimation using 
data from a single SMD simulation (using data sampled every 50 SMD steps with 7 = bps'^). The 
window size was 2A centered at two different values of Xq (the row with the Xq reported contains 
the known "local values" of the spline functions). 





A 




B 


C 




D 


Xo = 14 


8.409 X 10" 


-1 


-7.144 X 10-1 


6.376 X 10- 


-1 


-6.053 X 10-3 


Mean Euler 


9.024 X 10" 


-1 


-8.281 X 10-^ 


5.178 X 10- 


-1 


-5.803 X 10-3 


Std. Euler 


9.965 X 10" 


-2 


1.648 X 10-^ 


7.158 X 10" 


-3 


1.830 X 10-3 


Mean SML 


8.161 X 10^ 


^1 


-6.827 X 10-^ 


6.627 X 10" 


-1 


-5.981 X 10-3 



Std. SML 1.209 X 10-^ 1.767 x 10-^ 1.857 x 10"^ 1.842 x 10-3 



Xo = 22 -.2501 X 10^ 9.051 x lO'^ 5.322 x 10'^ -4.281 x 10-3 



Mean Euler 


-.2474 X 10^ 


9.685 X 10" 


-2 


4.610 X 10-1 


-3.934 X 10-3 


Std. Euler 


1.369 X 10-1 


4.444 X 10" 


-2 


7.290 X 10-3 


2.105 X 10-3 


Mean SML 


-.2465 X 10^ 


8.613 X 10- 


^2 


5.550 X 10-1 


-4.165 X 10-3 


Std. SML 


1.673 X 10-1 


2.523 X 10" 


-2 


1.514 X 10-2 


1.360 X 10-3 



We could easily increase the number of synthetic paths in an attempt to (at a relatively cheap 
computational cost) to create a larger number of synthetic trajectories in an attempt to get the 
large sample sizes needed to safely calculate an exponential average; however, a major point of this 
paper is that it more important to characterize the distribution of SDE coefficient functions than 
it is to sample the work paths that come from a small set of estimated SDEs. How this observation 
can possibly be exploited is discussed in Section IVIII 

The top two plots in Figure IHl show the work distributions associated with the 1000 = 
-^Qlocmod ^ j^QQCopies gyj^thetic paths and the bottom plot shows the work distribution corresponding 
to the same number of genuine SMD simulations. For each distribution plotted, the corresponding 
normal distribution (using the empirically measured mean and standard deviation) is plotted as 
well (using dashed lines). This was done because the stiff-spring approximation 3, ^ requires a 
normal distribution of work. A relatively small deviation from the normality assumption using an 
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exponential average to calculate the free energy can greatly affect the PMF calculation (recall our 
methodology does not require a normal distribution of work). 

The thick dashed-dotted line in Figure IHl represents the "reversible work" corresponding to 
the SMD process at 1.5ns (this time is associated with a target end-to-end distance z = 28A). The 
reversible work was determined from the "exact" PMF . Any value of work (in the histograms 
corresponding to 1.5ns) to the left of this line is considered to be an important rare event (in- 
terpreted physically, this corresponds to a violation of the second law of thermodynamics at the 
atomistic scale jsl). Note that the differences in the shapes of the tails of the actual work distribu- 
tion and those of the synthetic work distributions are significant; this accounts for the discrepancy 
in the PMF estimated using synthetic work paths at this state value. It should also be mentioned 
that the work distribution at this point does include an accumulation of errors from previously 
visited state points. Observe how the shapes of the synthetic distributions deviate more from those 
of the genuine SMD simulations as time progresses (errors are introduced both by finite sample 
estimation errors and inadequacies of the diffusion model) . The biggest source of error in the esti- 
mation of the PMF is likely due to the fact that all bootstrapped samples were given equal weight 
(bootstrapping a surrogate model obtained using data corresponding to a rare event explains the 
appearance of multiple modes in the 1.5ns synthetic work histograms). Possible remedies to these 
and other problems are discussed and demonstrated on a toy model in Section IVIII 



C. Testing the validity of the diffusion approximation 

In the previous section we saw that the PMF estimated from a small set of synthetic SDE 
trajectories could faithfully reproduce the PMF through approximating the work paths needed for 
the exponential average used in the Jarzynski nonequilibrium work relationship (in the relatively 
"simple" deca- alanine system studied). This may just be due to the fortunate fact that in the 
system studied that the work relationship has a certain "robustness" to the process that generates 
the work paths. Here "robustness" refers to a situation where a large collection of significantly 
different stochastic processes generate similar work distributions and the PMF estimated is ne arly 
independent of the differences in the work distributions that come from the various processes |53l |. 

Here we carry out goodness-of-fit tests that investigate quantitatively how well the diffusion 
approximation is given the global splines estimated from a sequence of local models and the actual 
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FIG. 5: The estimated gradient of the PMF obtained by running 100 (bootstrapped) copies of the 
sphne estimated from one SMD trajectory (dashed Unes) and that obtained by pooHng the work 
trajectories (soHd black) and the PMF obtained through a benchmark reversible pulling (solid red) 
experiment [3] (the inset displays the corresponding PMF). 

SMD data. The results depend heavily on the sample size, sampling frequency and on the statistical 
validity of the diffusion approximation. Figures d and [HI plot the sample correlation and histogram 
of {Zm} associated with one realization of the SMD process (using the estimated global nonlinear 
diffusion model with the SML expansion of the transition density). The red lines in both figures 
correspond to ±2(7 of the asymptotic distributions associated with a correctly specified model. 
Four different cases are plotted: one using the data and splines estimated from sampling from the 
SMD process every 150 steps with 7 = 5 ps~^, another sampling every 50 steps with 7 = 50 ps~"^, 
the third sampling every 50 steps with 7 = 5 ps~^ and the fourth that associated with using a 
genuine SDE with the splines estimated from the first trajectory sampled every 50 steps with 7 = 5 
ps~^. The overall shape of the histogram for the empirical SMD cases does not appear too bad 
given the size of the time series used (recall all SMD simulations contained 1 x 10^ time steps). The 
sample autocorrelation function indicates that there is significant serial correlation in the {Z^} 
series when using the actual SMD data. The results from the synthetic SDE trajectory are shown 
because the SML expansion may cause measurable errors in the PIT (the results indicate that these 
approximation errors are negligible in comparison to the imperfections of the assumed model). It 
should be noted that although we only show results from a single trajectory, the results for the 
other trajectories are qualitatively similar for the paths observed (this has interesting implications. 
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FIG. 6: The empirically measured work probability densities (the Gaussian distribution corre- 
sponding to the empirically measured mean and standard deviation is also plotted using dotted 
lines in all figures above). The top plot shows the results corresponding to running 100 (boot- 
strapped) copies for each of the 10 estimated global surrogate diffusion models obtained by sam- 
pling the 10 SMD trajectories every 150 SMD steps (0.3ps) ; the middle those corresponding to 
sampling every 50 SMD steps (O.lps); the bottom plot are those obtained by running 1000 genuine 
SMD trajectories. All simulations used a damping coefficient 7 = 5 ps~^. The work distributions 
plotted correspond to the distributions obtained at times (.4, .6, .8, 1.0, 1.2, 1.5) ns in all cases. 

see Section IV 11(1 . It is also interesting to observe the results obtained when one estimates the 
parameters using a lower damping coefficient and then analyzes the associated PIT series. In 
Figure El we estimate a model by sampling a SMD trajectory every 50 steps and use 7 = 0.1 ps~^ 
(it is likely that the system is underdamped using this damping coefficient and sampling frequency); 
note that the autocorrelation function clearly displays a more oscillatory response in comparison 
to the higher values of 7. 

Table Hll offers further insight into the model inadequacies. The first four columns report the 
computed M{j, m) statistic for a single realization for the various cases listed in the table . The final 
column plots the test-statistic obtained when a truly uniform (on [0, 1]) independent and identically 
distributed (i.i.d.) sequence of random variables is used (using the sample size corresponding to 
sampling every 50 SMD steps) to construct M(j, m). If the model is correctly specified, this 
statistic should be normally distributed with unit variance. We can obviously strongly reject all of 
the "empirical" models (this rejection should not be too discouraging given the fact that we have 
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FIG. 7: The sample correlation function between Z^, and Zm-j (Note: Z corresponds to the 
PIT rv, not the end-to-end distance z). The top figure corresponds to the PIT sequence obtained 
by using a genuine SMD trajectory sampled every 150 SMD time steps using 7 = 5 ps~^ and 
the transition density corresponding to the global model obtained using the same trajectory ; the 
second plot from the top to that of using data from a genuine SMD trajectory sampled every 50 
SMD time steps using 7 = 50 ps~^ and a transition density corresponding to a surrogate model 
using the same trajectory ; the second plot from bottom to that of using a genuine SMD trajectory 
sampled every 50 SMD time steps using 7 = 5 ps~^ and a transition density obtained from the 
same trajectory ; and the bottom to that of using a synthetic SDE trajectory (different from the 
trajectory used for estimation) with a SML approximation of the transition density that actually 
generated the data (obtained using model parameters from estimating along a SMD trajectory 
sampled every 50 SMD time steps using 7 = 5 ps^^). 

a fairly large time series of observations which facilitates rejecting the null hypothesis). 

These results suggest that some of the fast time scale motions of the underlying N-body process 
are still statistically measurable at the sampling frequency used. The fact that the 7 = 50ps~^ case 
only appears to be marginally better than the corresponding 7 = 5ps~^ case indicates that our 
overdamped approximation for the observation frequency used is probably not the major problem 
in our model (this is further supported by the empirical evidence in Figure 1^1). The Zm series (for 
the cases where 7 > 5ps~^) appears to be negatively correlated for a very short time window and for 
longer times it has a positive correlation that decays slowly. These phenomena are more than likely 
caused by the noise introduced by the fast vibrational motion and the slower long range Lennard- 
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FIG. 8: The histogram of Zm corresponding to the global surrogate model . The top plot cor- 
responds to the first SMD trajectory sampled every 150 SMD time steps using 7 = 5 ps~^; the 
next to that the case of sampling every 50 SMD time steps using 7 = 50 ps~^;the third plot to the 
case of sampling every 50 SMD time steps using 7 = 5 ps~^; and the bottom to a synthetic SDE 
trajectory (with splines constructed using the first SMD trajectory sampled every 50 SMD time 
steps and 7 = 5 ps"-*^). 



TABLE II: Computed M-test Statistic ( "ds" denotes the down sampling rate; e.g. ds=50 indicates 
that the time series was constructed using the output of the SMD simulation recorded every 50 
steps) and the ordered pair in the column indicates M{m,j). 



Case 








(1,1) 


(2,1) 


(1,2) 


(2,2) 


(3,3) 


(4,4) 


Empirical (7 = 


= 5ps ^ 


, ds= 


=150) 


20.5337 


19.3370 


19.9771 


18.9426 


16.6009 


14.4330 


Empirical (7 = 


= bOps" 


\ ds 


=50) 


84.0644 


80.1934 


80.8452 


81.6811 


78.1994 


74.2394 


Empirical (7 = 


= 5ps~^ 


, ds= 


=50) 


92.2045 


85.0764 


86.0400 


81.7107 


73.3389 


66.4957 


Synthetic (7 = 


= 5ps-\ 


ds= 


50) 


-0.3584 


-0.1151 


-0.6810 


-0.6483 


-0.9822 


-1.3087 


Uniform i.i.d. 


rv 






-0.7169 


-0.8644 


-0.8615 


-0.8979 


-0.9049 


-0.8484 



Jones type interactions between nonadjacent atoms of the deca-alanine molecule (respectively). The 
fact that M(l, 1) is the greatest in all of the "empirical cases" strongly suggests that something 
is askew with the drift; it is likely that the dynamics of a scalar reaction coordinate are either 
non-Markovian and/or the drift coefficient is more oscillatory than we assumed (our local models 
only yield smooth local functions). In some cases it may be possible to reduce the error associated 
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FIG. 9: The autocorrelation of the PIT associated with actual SMD data corresponding to a 
transition density obtained by estimating the parameters sampling every 50 SMD time steps and 
using 7 = 0.1 ps~^ (using the same trajectory). See Figure [71 for additional details. Note the more 
oscillatory nature of this empirical autocorrelation. 



with the former possibility by adding another reaction coordinate into the model and/or sampling 
the SMD data less frequently; the latter possibility can be tested by using smaller windows for 
the local model. Unfortunately if a very small window is used, the behavior of the small sample 
parameter estimates may become sporadic (increasing the sampling frequency is also problematic 
due to the fact that the Markovian assumption on the dynamics of the reaction coordinate typically 



breaks down at some limit of sampling frequency 
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VII. DISCUSSION AND POSSIBLE EXTENSIONS 

We have already mentioned that in certain systems, the Jarzynski work relationship may not be 
computationally practical when work paths associated with realizations that have a low probability 
cause a large influence on the PMF computed from a finite sample of realizations ^ . Under these 
circumstances, the number of realizations needed to get a good estimate of the exponential average 
used in the work relationship is too large in many systems of interest. This fact fuels interest 
in methods which can infer the basic statistical structure from a small number of actual SMD 
realizations. 

In this paper, we demonstrated a method by which one can approximate the dynamics of a 
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reaction coordinate associated with a single realization by fitting a diffusion SDE (this is then 
repeated using different SMD reahzations). The advantage this offers is that one can associate a 
diffusion model with a configuration drawn from a microscopic ensemble assumed to be at equi- 
librium (recall the Jarzynski work relationship requires the initial configuration to be drawn from 
equilibrium distribution 0]). The diffusion model estimated (which will also depend on the SMD 
parameters) can then be "bootstrapped" by running the SDE using different driving Brownian 
paths in an effort to increase work path sampling. The basic idea being that an interesting rare 
event may be associated with the equilibrium configuration used (or a minor perturbation of it), 
we just did not observe the rare event in the single realization drawn. 

The bootstrapped samples can even be given a loose physical interpretation: Running multiple 
surrogate simulations using different Brownian paths (using a single, fixed, estimated diffusion) 
corresponds roughly to multiple realizations from a stochastic process that describes the dynamics 
associated with certain features of one configuration (e.g. a configuration obtained by recording 
the positions of all of the atoms in the deca-alanine molecule from a previous run, but assigning new 
velocities). The "noise" of the process is meant to represent the influence that various items have 
on the dynamics of the low-dimensional reaction coordinate. Examples of possible noise sources 
include the "internal" neglected degrees of freedom of the molecule of interest (e.g. the dihedral 
ais les), the effects of different initial velocity distributions, and different solvent conflgurations 

This paper also claimed that one might have a collection of diffusion models associated with the 
"pulling dynamics" in certain systems and that the collection has a structure that can possibly be 
exploited. One can hope that in general systems the dynamics of a good set of reaction coordinates 
(whose selection is not always trivial) associated with a single realization from the SMD simulation 
may be describable by a diffusion SDE with deterministic coefficients, however the coefficient 
functions of the SDE may depend very heavily on the details of the underlying initial atomistic 
configuration drawn. If this is the case, then one may be interested in characterizing the distribution 
of coefficient functions in an effort to improve the PMF estimation by using the Jarzynski work 
relationship with synthetic work paths. If one somehow knew the distribution of curves, then one 
could appeal to tools associated with decision theory I21I to improve work path sampling by using 
bootstrapped samples from the collection of associated surrogate diffusion models. 

To make a concrete demonstration of these ideas, we turn to a toy model. Suppose the dynamics 
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of a scalar reaction coordinate of interest are given by: 

dXt = Ki{ai - Xt)dt + ai^/YtdWt (16) 

Where the parameter vector 0j := [ai,Ki,ai] is distributed according to the normal law N{fi,T,) 
where = [a, k, a] is a deterministic vector and S is a deterministic constant symmetric positive 
definite matrix. Suppose we calculate the "exact" PMF associated with this toy model by sampling 
1000 realizations obtained by first drawing a 0j (using an initial state value from the invariant 
distribution associated with Equation using 0j = /i) and then carrying out a constant velocity 
pulling experiment; we then ask: "What can we do to improve the PMF estimation using small 
sample sizes knowing a priori that the we have multiple underlying models that come from a known 
distribution of model parameters?" 

In Figure [TUl the "exact" and estimated PMFs are plotted using 10 and 100 genuine SMD (top 
and bottom plots respectively) paths. The curve labelled "Ref contains the PMF estimated using 
1000 genuine trajectories. The curve labelled "Raw" uses only a subset (10 or 100) of the original 
SMD paths to compute the PMF using a standard empirical average of the exponential random 
variables. The curve labelled "Rpt (unweighted)" plots the result obtained by noting the Gj and 
initial condition of each of the "Raw" trajectories and running additional simulations to create more 
work paths (100 and 10 respectively). The PMF is then computed by taking standard exponential 
averages with the additional trajectories. In the curves labelled "Rpt (weighted)" we do the same 
except in the exponential average we assign weights determined by the observed 0j and its known 
distribution (the density determines the relative weight and this is used to create a linear weighting 
coefficient). It should be noted in this toy example we assume perfect parameter estimation (which 
in practice we do not usually have the luxury of) and exact knowledge of the underlying distribution 
associated with the 0j random variables (which we will likely never have). The weighting scheme 
used is done for illustrative purposes only (in 10] "optimal" weighting schemes are discussed) 

The aim of this example was to demonstrate that in small samples, there can be a significant 
benefit to estimating the parameters associated with a single trajectory and rerunning a synthetic 
trajectory to aid in (approximate) work path sampling. The demonstration also showed that 
a simple weighting scheme could be of assistance in a problem that satisfies our assumptions 
(in complex systems we can only hope that the simple assumptions made about the toy model 
approximately carry over). The example also shows that the largest benefit would come from a 
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FIG. 10: The top figure was obtained using 10 genuine SMD trajectories and the bottom plot was 
obtained using 100 trajectories. The difference between the curves is discussed in section IVlII 

sampUng technique that would allow us to accurately draw from the distribution associated with 
the Gj random variables and then run a larger batch of synthetic SDEs. Of course estimating 
this multivariate distribution with a small sample size with no prior information is a very hard 
task, but in the deca-alanine system studied it appears that the local parameters are all drawn 
from a single connected region in parameter space which may be well approximated by a simple 
distribution like a multivariate normal (this is just speculation). 

In the future, the ideas laid out in this study should be applied to other more complicated 
systems to see if the coefficient functions of the SDE used to approximate the dynamics of the 
reaction coordinate seem to come from a small family of functions. In complicated systems, we 
will probably not be lucky enough to observe a single family of functions. The hope is that a 
small number of easily discernable families emerge and that these families can be approximated by 



and growth curve analysis 



An 



employing the help of those involved with empirical Bayes 
interesting twist encountered here is in how to utilize the information obtained about neighboring 
local models into making decisions about a specific local model. 

Success in this type of research endeavor would possibly allow us to generate the large number 
of samples needed to accurately sample rare work events by using bootstrapped samples from a 
family of synthetic surrogate stochastic processes. This may result in an approximation scheme 
that significantly reduces the computational load typically needed to generate the work paths by 
using standard SMD simulations (furthermore it is likely that the local models and parameters 
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characterizing the local models depend smoothly on system parameters like temperature). 

Characterizing the distribution of curves may be too computationally difficult in some situa- 
tions. However the structure contained in the models we fit can be utilized to test a variety of 
other hypothesis given the model and the data (this information can then be used to gain useful 
quantitative insight about the system). For example the Markov property can be tested utilizing 
the transition density of the assumed diffusion model (or any other stochastic process that allows 
one a method to evaluate the associated transition density). Also, the goodness-of-fit improvement 
obtained by adding other reaction coordinates or a more complicated model of the heat bath into 
the surrogate model can be quantified by using the basic idea behind the approach presented (for 
example in the deca-alanine example we may want to monitor the dynamics of the radii of gyration 
of the molecule in addition to the linear end-to-end distance and/or use a jump process to model 
the breaking of important hydrogen bonds). The type of diagnostics discussed above can be used 
to (quantitatively) approximate the timescale at which a diffusion model is valid and/or determine 
the smallest number of reaction coordinates needed to write a meaningful coarse-grained dynamic 
model. 

In regards to multiscale modeling, it is interesting to note that Figure ^2 makes it appear 
that the conditional residuals (:= ^j_|_At ~ [^t+At l"'^*]) ' where the superscripts are meant to 
distinguish between global diffusion models and paths associated with different SMD realizations) 
come from a single stationary process. The empirically measured autocorrelation function of the 
series above are all similar and the differences between the autocorrelation appear to be random 
(indicating that a single underlying autocorrelation function would be a useful approximation 
of the conditional residual process). This suggests that the key differences in the equilibrium 
configurations (e.g. different positions of the deca-alanine atoms) used to start an SMD simulation 
determine the coefficient functions of the different diffusion processes, but the deviations from the 
collection of diffusion models are due to the fast time scale motions "contaminating" the diffusions 
in a "statistically similar" fashion irrespective of the details of the initial configuration. To get 
more realistic surrogate models one could appeal to some well established time series procedures 



(ARMA,ARCH, or GARCH modeling 84]) to model the contamination processes (which could 
possibly use data pooled together from different time series) and then utilize the collection of 
oversimplified diffusion models to estimate the conditional expectations yielding a "two-scale" type 
approximation which can possibly be used to incorporate a more realistic noise into the surrogate 
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FIG. 11: The empirically measured autocorrelation function is plotted for the conditional residu- 
als := Xl_^^^ — ¥F'^i[Xl_^^^\Xl] where the superscripts are meant to distinguish between diffusion 
models and paths associated with different SMD realizations (here results from five different SMD 
realizations are plotted). The straight lines correspond to the 95% confidence bands under the 
assumption of no correlation. The top inset plots the time series of X^^^^^ — FF'^i [X^^^^lXl] for a 
single realization. The series appears to come from a stationary distribution (making the autocor- 
relation functions measured over the entire time series meaningful). The bottom inset zooms in on 
the autocorrelation function in order to show that systematic difference between the time series do 
not seem to "persist" indicating the the deviations from the diffusion process can be modeled as a 
single stochastic process. 

model; this would also be an interesting direction to investigate in the future. 

VIII. CONCLUSIONS 

We have demonstrated a methodology by which one can use local diffusion modeling in order 
to construct a global nonlinear SDE which can be used to approximate the time series that comes 
out of a SMD process. It was also shown that the diffusion approximation estimated in the deca- 
alanine example studied appears to come from a single family of diffusion models (the specific 
family was demonstrated to depend on the sampling frequency used to calibrate the models). 
The statistical validity of the surrogate models was tested using the PIT and it was shown that 
the simple diffusion process appears to be a reasonable (albeit strongly rejected by large sample 
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hypothesis tests) approximation of the true process and it was demonstrated that the surrogate 
models capture many of the sahent features needed to reproduce a PMF. Some extensions that 
could utilize the knowledge of researchers involved with empirical Bayes methods and growth curve 
analysis were also discussed. 
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